Levene’s Test (Homogeneity of Variances)#

Levene’s test is a hypothesis test for equal variances across 2+ independent groups. It’s commonly used as an assumption check before methods that assume homoscedasticity (equal within-group variance), such as classical one-way ANOVA.


Learning goals#

By the end you should be able to:

  • state the null/alternative hypotheses and what rejecting means

  • explain the key trick: convert a variance problem into a mean problem

  • compute the Levene statistic from scratch using NumPy

  • interpret the output (test statistic, degrees of freedom, p-value)

  • use Plotly visuals to build intuition (raw samples vs deviation variables)

What it’s used for#

Levene’s test answers:

“Do these groups have the same spread/variability?”

Typical use-cases:

  • ANOVA / linear models: check whether residual variability looks similar across groups.

  • A/B/n tests: check if one variant causes outcomes to be more variable.

  • Quality control: check if batches/machines differ in variability.

  • Finance/ops: detect regime changes where volatility differs across segments.

Important:

  • This is a test about variances, not means. Groups can have different means and still have equal variances.

  • If you reject, Levene’s test does not tell you which groups differ; it only says “at least one variance differs”.

  • For two-sample mean comparisons, it’s often better to use Welch’s t-test directly rather than “pre-testing” variances.

Hypotheses and interpretation#

For \(k\) groups with population variances \(\sigma_1^2, \dots, \sigma_k^2\):

  • Null hypothesis: \(H_0: \sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2\) (all variances equal)

  • Alternative: \(H_1\): at least one variance differs

Decision rule at significance level \(\alpha\) (e.g. 0.05):

  • if p-value < α: reject \(H_0\) → evidence that variances are not all equal

  • if p-value α: fail to reject \(H_0\) → not enough evidence of a variance difference

A p-value is not the probability that \(H_0\) is true. It is the probability (under \(H_0\)) of getting a test statistic at least as extreme as the one observed.

The core idea (why Levene works)#

Levene’s test turns “equal variances?” into “equal means?” by constructing a new variable:

  1. For each group \(i\), pick a center (one of):

    • mean (original Levene)

    • median (Brown–Forsythe variant; more robust)

    • trimmed mean (robust)

  2. Compute absolute deviations from the group center:

\[z_{ij} = |x_{ij} - c_i|\]

If groups truly have the same variance, then the typical size of these deviations should be similar across groups.

So we run a one-way ANOVA on \(z_{ij}\):

\[W = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} = \frac{\left(\sum_i n_i(\bar z_i-\bar z)^2\right)/(k-1)}{\left(\sum_i\sum_j (z_{ij}-\bar z_i)^2\right)/(N-k)}\]

Where \(n_i\) is group size, \(N=\sum_i n_i\), and \(\bar z_i\) are group means of the deviations.

Under \(H_0\), \(W\) is approximately \(F\)-distributed with df1 = k-1 and df2 = N-k.

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

SEED = 42
rng = np.random.default_rng(SEED)

print("numpy:", np.__version__)
print("pandas:", pd.__version__)
print("plotly:", __import__("plotly").__version__)
numpy: 1.26.2
pandas: 2.1.3
plotly: 6.5.2

From-scratch NumPy implementation#

Below is a low-level implementation of the Levene statistic. The goal is to make the mechanics explicit:

  • compute group centers \(c_i\)

  • convert samples to deviations \(z_{ij} = |x_{ij}-c_i|\)

  • compute the one-way ANOVA \(F\) statistic on the deviations

For the p-value, we show two options:

  • a NumPy-only Monte Carlo approximation using the \(F\) distribution definition as a ratio of chi-squares

  • (optional) a SciPy reference value (scipy.stats.f.sf) to match what you’d do in practice

def trimmed_mean(x: np.ndarray, proportiontocut: float) -> float:
    x = np.asarray(x, dtype=float).ravel()
    if not (0 <= proportiontocut < 0.5):
        raise ValueError("proportiontocut must be in [0, 0.5).")
    if x.size == 0:
        raise ValueError("Empty sample.")
    k = int(np.floor(proportiontocut * x.size))
    if 2 * k >= x.size:
        raise ValueError("Trim proportion too large for this sample size.")
    x_sorted = np.sort(x)
    return float(x_sorted[k : x.size - k].mean())


def _center(x: np.ndarray, center: str, proportiontocut: float) -> float:
    center = center.lower()
    if center == "mean":
        return float(np.mean(x))
    if center == "median":
        return float(np.median(x))
    if center in {"trimmed", "trimmed_mean"}:
        return trimmed_mean(x, proportiontocut)
    raise ValueError("center must be one of: 'mean', 'median', 'trimmed'.")


def levene_statistic(*groups: np.ndarray, center: str = "median", proportiontocut: float = 0.05):
    groups = [np.asarray(g, dtype=float).ravel() for g in groups]

    if len(groups) < 2:
        raise ValueError("Levene's test needs at least 2 groups.")
    if any(g.size < 2 for g in groups):
        raise ValueError("Each group must contain at least 2 observations.")
    if not all(np.isfinite(g).all() for g in groups):
        raise ValueError("All observations must be finite (no NaN/inf).")

    k = len(groups)
    n = np.array([g.size for g in groups], dtype=int)
    N = int(n.sum())
    df1 = k - 1
    df2 = N - k

    centers = np.array([
        _center(g, center=center, proportiontocut=proportiontocut) for g in groups
    ])

    z_groups = [np.abs(g - c) for g, c in zip(groups, centers)]
    z_means = np.array([z.mean() for z in z_groups])
    z_bar = float(np.sum(n * z_means) / N)

    ss_between = float(np.sum(n * (z_means - z_bar) ** 2))
    ss_within = float(np.sum([np.sum((z - m) ** 2) for z, m in zip(z_groups, z_means)]))

    if ss_within == 0.0:
        W = np.inf if ss_between > 0 else 0.0
    else:
        W = (ss_between / df1) / (ss_within / df2)

    details = {
        "centers": centers,
        "z_groups": z_groups,
        "z_means": z_means,
        "z_bar": z_bar,
        "ss_between": ss_between,
        "ss_within": ss_within,
        "df1": df1,
        "df2": df2,
    }
    return W, details


def f_pvalue_mc(W: float, df1: int, df2: int, n_mc: int = 200_000, seed: int = 0) -> float:
    """Right-tail p-value via NumPy-only Monte Carlo for an F(df1, df2) statistic."""
    if not np.isfinite(W):
        return 0.0
    if n_mc <= 0:
        raise ValueError("n_mc must be positive.")

    rng_local = np.random.default_rng(seed)
    f_sim = (rng_local.chisquare(df1, size=n_mc) / df1) / (rng_local.chisquare(df2, size=n_mc) / df2)
    return float(np.mean(f_sim >= W))


def levene_test_numpy(
    *groups: np.ndarray,
    center: str = "median",
    proportiontocut: float = 0.05,
    n_mc: int = 200_000,
    seed: int = 0,
):
    W, details = levene_statistic(*groups, center=center, proportiontocut=proportiontocut)
    p_mc = f_pvalue_mc(W, details["df1"], details["df2"], n_mc=n_mc, seed=seed)
    return W, p_mc, details


def plot_raw_vs_deviation(groups, group_names, centers, center_label: str):
    fig = make_subplots(
        rows=1,
        cols=2,
        subplot_titles=(
            "Raw samples",
            f"Absolute deviations |x - {center_label}(group)|",
        ),
        horizontal_spacing=0.15,
    )

    for name, x, c in zip(group_names, groups, centers):
        fig.add_trace(
            go.Violin(
                y=x,
                name=name,
                box_visible=True,
                meanline_visible=True,
                points="outliers",
            ),
            row=1,
            col=1,
        )
        fig.add_trace(
            go.Violin(
                y=np.abs(x - c),
                name=name,
                box_visible=True,
                meanline_visible=True,
                points="outliers",
                showlegend=False,
            ),
            row=1,
            col=2,
        )

    fig.update_layout(violinmode="group", height=420)
    fig.update_yaxes(title_text="value", row=1, col=1)
    fig.update_yaxes(title_text=f"|x - {center_label}(group)|", row=1, col=2)
    return fig


def plot_mean_abs_deviation(z_groups, group_names):
    mad = np.array([float(np.mean(z)) for z in z_groups])
    fig = go.Figure(go.Bar(x=group_names, y=mad, text=np.round(mad, 4)))
    fig.add_hline(
        y=float(mad.mean()),
        line_dash="dash",
        annotation_text="grand mean",
        annotation_position="top left",
    )
    fig.update_traces(textposition="outside")
    fig.update_layout(
        title="Group means of |x - center(group)| (what the ANOVA compares)",
        xaxis_title="group",
        yaxis_title="mean absolute deviation",
        height=360,
    )
    return fig


def plot_f_null_distribution(W: float, df1: int, df2: int, n: int = 150_000, seed: int = 0):
    rng_local = np.random.default_rng(seed)
    f_sim = (rng_local.chisquare(df1, size=n) / df1) / (rng_local.chisquare(df2, size=n) / df2)

    fig = go.Figure(go.Histogram(x=f_sim, nbinsx=80, histnorm="probability density"))
    fig.add_vline(
        x=W,
        line_color="crimson",
        line_width=3,
        annotation_text="observed W",
        annotation_position="top right",
    )
    fig.update_layout(
        title=f"Null reference: simulated F({df1}, {df2}) density + observed W",
        xaxis_title="F value",
        yaxis_title="density",
        height=360,
        showlegend=False,
    )
    return fig

Example 1: different means, similar variances#

Here the groups have noticeably different means, but we generate them with the same standard deviation. Levene’s test should typically fail to reject \(H_0\) (equal variances).

n = 160
g1 = rng.normal(loc=0.0, scale=1.0, size=n)
g2 = rng.normal(loc=2.0, scale=1.0, size=n)
g3 = rng.normal(loc=-1.0, scale=1.0, size=n)

groups = [g1, g2, g3]
names = ["A", "B", "C"]

W, p_mc, details = levene_test_numpy(*groups, center="median", n_mc=200_000, seed=1)
print(f"NumPy Levene (median center): W={W:.4f}, df=({details['df1']}, {details['df2']}), p_mc≈{p_mc:.4f}")

# Practical reference (analytic p-value)
from scipy.stats import f as f_dist, levene as levene_scipy

p_f = float(f_dist.sf(W, details["df1"], details["df2"]))
W_scipy, p_scipy = levene_scipy(*groups, center="median")
print(f"SciPy reference:         W={W_scipy:.4f}, p={p_scipy:.4f}")
print(f"F survival (SciPy f.sf): p={p_f:.4f}")

fig = plot_raw_vs_deviation(groups, names, details["centers"], center_label="median")
fig.update_layout(title="Example 1: raw samples vs absolute deviations")
fig.show()

fig = plot_mean_abs_deviation(details["z_groups"], names)
fig.show()
NumPy Levene (median center): W=1.6790, df=(2, 477), p_mc≈0.1864
SciPy reference:         W=1.6790, p=0.1877
F survival (SciPy f.sf): p=0.1877

Example 2: different variances#

Now the groups have similar means but different spreads. Levene’s test should typically reject \(H_0\).

n = 160
g1 = rng.normal(loc=0.0, scale=1.0, size=n)
g2 = rng.normal(loc=0.0, scale=2.0, size=n)
g3 = rng.normal(loc=0.0, scale=0.5, size=n)

groups = [g1, g2, g3]
names = ["sigma=1.0", "sigma=2.0", "sigma=0.5"]

W, p_mc, details = levene_test_numpy(*groups, center="median", n_mc=200_000, seed=2)
p_f = float(f_dist.sf(W, details["df1"], details["df2"]))

print(f"NumPy Levene (median center): W={W:.4f}, df=({details['df1']}, {details['df2']}), p_mc≈{p_mc:.6f}")
print(f"Analytic p (SciPy f.sf):     p={p_f:.6f}")

fig = plot_raw_vs_deviation(groups, names, details["centers"], center_label="median")
fig.update_layout(title="Example 2: raw samples vs absolute deviations")
fig.show()

fig = plot_mean_abs_deviation(details["z_groups"], names)
fig.show()

# A visual way to read the p-value: where does W fall under the null F distribution?
fig = plot_f_null_distribution(W, details["df1"], details["df2"], n=150_000, seed=3)
fig.show()
NumPy Levene (median center): W=95.0895, df=(2, 477), p_mc≈0.000000
Analytic p (SciPy f.sf):     p=0.000000

Why people often use the median center (robustness)#

The original Levene test uses the mean as the group center, but a common robust variant uses the median (often called the Brown–Forsythe test).

Intuition:

  • The mean is more sensitive to outliers and heavy tails.

  • Using the median tends to control false positives better when data are non-normal.

Below we simulate under \(H_0\) (equal variances) and compare how often each variant rejects at \(\alpha=0.05\).

from scipy.stats import f as f_dist


def pvalue_f(W: float, df1: int, df2: int) -> float:
    return float(f_dist.sf(W, df1, df2))


def estimate_type1_error(
    *,
    gen_groups,
    center: str,
    n_reps: int = 600,
    alpha: float = 0.05,
    seed: int = 123,
):
    rng_local = np.random.default_rng(seed)
    rejects = 0

    for _ in range(n_reps):
        groups = gen_groups(rng_local)
        W, details = levene_statistic(*groups, center=center)
        p = pvalue_f(W, details["df1"], details["df2"])
        rejects += int(p < alpha)

    return rejects / n_reps


k = 3
n = 30


def gen_normal(rng_local):
    return [rng_local.normal(loc=0.0, scale=1.0, size=n) for _ in range(k)]


def gen_t3_heavy_tails(rng_local):
    # Standard t(df=3) has variance df/(df-2)=3, so scale by 1/sqrt(3) to make variance ~1.
    return [rng_local.standard_t(df=3, size=n) / np.sqrt(3.0) for _ in range(k)]


rates = []
for dist_name, gen in [("Normal", gen_normal), ("t(df=3) heavy tails", gen_t3_heavy_tails)]:
    for center in ["mean", "median"]:
        rate = estimate_type1_error(gen_groups=gen, center=center, n_reps=700, alpha=0.05, seed=7)
        rates.append({"distribution": dist_name, "center": center, "rejection_rate": rate})

df_rates = pd.DataFrame(rates)
df_rates
distribution center rejection_rate
0 Normal mean 0.052857
1 Normal median 0.040000
2 t(df=3) heavy tails mean 0.067143
3 t(df=3) heavy tails median 0.041429
fig = px.bar(
    df_rates,
    x="distribution",
    y="rejection_rate",
    color="center",
    barmode="group",
    text="rejection_rate",
    title="Type I error under H0 (how often we reject when variances are equal)",
)
fig.add_hline(
    y=0.05,
    line_dash="dash",
    annotation_text="alpha = 0.05",
    annotation_position="top left",
)
fig.update_traces(texttemplate="%{text:.3f}", textposition="outside")
fig.update_layout(yaxis_title="rejection rate", height=380)
fig.show()

Practical checklist + pitfalls#

  • Independence matters: Levene assumes observations are independent within and across groups.

  • Scale of measurement: works best for continuous/interval data; with discrete/ordinal outcomes interpret carefully.

  • What rejection means: at least one group variance differs → consider Welch’s ANOVA, robust methods, variance-stabilizing transforms, or modeling heteroscedasticity directly.

  • What non-rejection means: not enough evidence of variance differences ≠ proof of equal variances (power can be low for small samples).

  • Multiple comparisons: Levene is an omnibus test; if you follow with pairwise variance checks, correct for multiple testing.

Alternatives:

  • Bartlett’s test: more powerful under normality, but very sensitive to non-normal data.

  • Fligner–Killeen: nonparametric/robust test for equal variances.

Exercises#

  1. Modify Example 2 so only one group has a larger variance. How does \(W\) change?

  2. Increase/decrease sample sizes and see how the p-value behaves.

  3. Implement a simple “follow-up” routine: plot sample variances and bootstrap confidence intervals per group.

  4. Compare Levene (median center) to scipy.stats.fligner on heavy-tailed data.

References#

  • Levene, H. (1960). Robust tests for equality of variances.

  • Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances.

  • SciPy documentation: scipy.stats.levene